Clustering with shallow trees 
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We propose a new method for hierarchical clustering based on the optimisation of a cost function 
over trees of limited depth, and we derive a message-passing method that allows to solve it efficiently. 
The method and algorithm can be interpreted as a natural interpolation between two well-known 
approaches, namely single linkage and the recently presented Affinity Propagation. We analyze with 
this general scheme three biological/medical structured datasets (human population based on genetic 
information, proteins based on sequences and verbal autopsies) and show that the interpolation 
k*" technique provides new insight. 



I. INTRODUCTION 



A standard approach to data clustering, that we will also follow here, involves defining a distance measure between 
objects, called dissimilarity. In this context, generally speaking data clustering deals with the problem of classifying 
objects so that objects within the same class or cluster are more similar than objects belonging to different classes. 
The choice of both the measure of similarity and the clustering algorithms are crucial in the sense that they define 
an underlying model for the cluster structure. In this work we discuss two somewhat opposite clustering strategies, 
and show how they nicely fit as limit cases of a more general scheme that we propose. 

Two well-known general approaches that are extensively employed are partitioning methods and hierarchical clus- 
tering methods Partitioning methods are based on the choice of a given number of centroids - i.e. reference 
elements - to which the other elements have to be compared. In this sense the problem reduces to finding a set 
of centroids that minimises the cumulative distance to points on the dataset. Two of the most used partitioning 
algorithms are i^-means (KM) and Affinity Propagation Behind these methods, there is the assumption 

of spherical distribution of data: clusters are forced to be loosely of spherical shape, with respect to the dissimilarity 
metric. These techniques give good results normally only when the structure underlying the data fits this hypothesis. 
Nevertheless, with Soft Affinity Propagation [2j the hard spherical constraint is relaxed, allowing for cluster struc- 
, tures including deviation from the regular shape. This method however recovers partially information on hierarchical 
■ organisation. On the other hand, Hierarchical Clustering methods such as single linkage (SL) [3], starts by defining 
r**"- | a cluster for each element of the system and then proceeds by repeatedly merging the two closest clusters into one. 
i This procedure provides a hierarchic sequence of clusters. 

Recently an algorithm to efficiently approximate optimum spanning trees with a maximum depth D was presented 
in We show here how this algorithm may be used to cluster data, in a method that can be understood as a 
generalisation of both (or rather an interpolation between) the AP and SL algorithms. Indeed in the D = 2 and 
D = n limits - where n is the number of object to cluster - one recovers respectively AP and SL methods. As a 
proof of concept, we apply the new approach to a collection of biological and medical clustering problems on which 
intermediate values of D provide new interesting results. In the next section, we define an objective function for 
clustering based on the cost of certain trees over the similarity matrix, and we devise a message-passing strategy to 
optimise the objective function. The following section is devoted to recovering two known algorithms, AP and SL, 
which are shown to be special cases for appropriately selected values of the external parameters D. Finally, in the 
last section we use the algorithm on three biological/medical data clustering problems for which external information 
can be used to validate the algorithmic performance. First, we cluster human individuals from several geographical 
origins using their genetic differences, then we tackle the problem of clustering homologous proteins based only on 
their amino acid sequences. Finally we consider a clustering problem arising in the analysis causes-of-death in regions 
where vital registration systems are not available. 



II. A COMMON FRAMEWORK 



Let's start with some definitions. Given n datapoints, we introduce the similarity matrix between pairs Sij, where 
i,j G [1, . . . ,n]. This interaction could be represented as a fully connected weighted graph G(n,s) where s is the 
weight associated to each edge. This matrix contitutes the only data input for the clustering methods discussed in 
this manuscript. We refer in the following to the neighbourhood of node i with the symbol di, denoting the ensemble 



FIG. 1: Clustering an artificial 2D image. The black image on the left was randomly sampled and the euclidean distance 
was used as a measure of disimilarity between nodes. Clustering by D-MST was then attempted on the resulting graph. One 
external root vertex v* (red point) was added, with distance A to every other points. The ouput of the algorithm consists in 
a minimum weight rooted spanning tree of depth D pointed out by bold links. The last three figures concern the resulting 
clustering for different choices of the depth limit D — 2, 4, > n respectively. Different clusters with a complex internal structure 
can be recovered after removing the red node v* . In the case of AP D — 2 (second figure) the spherical clusters do not fit the 
ellipsoidal shape of the original figure while for 4-MST (third figure) the structure of two ellipses can be recovered. The fourth 
and last figure corresponds to SL (D > n): in this case nodes are split into two arbitrary components disregarding the original 
shape. 

of all nearest neighbours of i. By adding to the graph G one artificial node v* , called root, whose similarity to all 
other nodes i £ G is a constant parameter A, we obtain a new graph G*(n + 1, a*) where s* is a (n + 1) x (n + 1) 
matrix with one added row and column of constant value to the matrix s (see Figure [I]). 

We will employ the following general scheme for clustering based on trees. Given any tree T that spans all the 
nodes in the graph G* (n + 1, s*), consider the (possibly disconnected) subgraph resulting of removing the root v* 
and all its links. We will define the output of the clustering scheme as the family of vertex sets of the connected 
components of this subgraph. That is, each cluster will be formed by a connected component of the pruned T \v* . 
In the following, we will concentrate on how to produce trees associated to G* . 

The algorithm described in [j| was devised to find a tree of minimum weight with a depth bounded by D from a 
selected root to a set of terminal nodes. In the clustering framework, all nodes are terminals and must be reached by 
the tree. As a tree has exactly n — 1 links, for values of D greater or equal than n the problem becomes the familiar 
(unconstrained) minimum spanning tree problem. In the rest of this section we will describe the D-MST message 
passing algorithm of @ for Steiner trees in the simplified context of (bounded depth) spanning trees. 

To each node of the graph we associate two variables 7rj , and di where 71", £ di could be interpreted as a pointer from 
i to one of the neighbouring nodes j £ di. Meanwhile di £ [0, . . . ,D] is thought as a discrete distance between the 
node i and the root v* along the tree. Necessarily only the root has zero distance d v * = 0, while for all other nodes 
di £ [1, . . . , D]. In order to ensure global connectivity of the D-MST, these two variables must satisfy the following 
condition: 7r$ = j =>• di = dj + 1. This means that if node j is the parent of node i, then the depth of node i must 
exceed the depth of the node j by precisely one. This condition avoids the presence of loops and forces the graph to 
be connected, assigning non-null weight only to configurations corresponding to trees. The energy function thus reads 

E({TTj, = y, gj,7T» ~ 2J (hij(iri,Trj,di,dj) + hji(irj,iri,dj,di)) , (1) 

where is defined as 

j0 { 7 r i =j=^d i = ^ + l} 
J 1 —00 else v ; 

In this way only configurations corresponding to a tree are taken into account with the usual Boltzmann weight factor 
g-Z^i.TTi wnere the external parameter (3 fixes the value of energy level. Thus the partition function is 

Z{P)= £ e-W^,*}) = £ JJe-^ x [I fij , (3) 

{iTi,di} {iTi.di} i ijedi 

where we have introduce an indicator function /,j = gijgji. Each term = 1 — 8 Wi j (l — ^d^di-i) is equivalent to 
e hij while Sij is the delta function. In terms of these quantities fij it is possible to derive the cavity equations, i.e. 
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the following set of coupled equations for the cavity marginal probability Pj_>i(dj, TTj) of each site j S [1, . . . ,n] after 
removing one of the nearest neighbours i € dj: 

Pj^iidj,^) ace-P'^t J| Qk-^jidj,^) (4) 

kedj/i 

Q k ^j(dj,Wj) oc ^ Pk^j(dk,^k)fjk(dj,TTj,d k) TTk) ■ (5) 

These equations are solved iteratively and in graphs with no cycles they are guaranteed to converge to a fixed point 
that is the optimal solution. In terms of cavity probability we are able to compute marginal and joint probability 
distribution using the following relations 

Pj (dj , TTj) OC \[ Q k ^j (dj ,7Tj) (6) 
Pij(di,TTi,dj,TTj) OC Pi^j(di,TTi)Pj^i(dj,TTj)fij(di,1Ti,dj,Trj) . (7) 

For general graphs convergence can be forced by introducing a "reinforcement" perturbation term as in 0, @. This 
leads to a new set of perturbed coupled equations that show good convergence properties. The (3 — > oo limit is taken 
by considering the change of variable ipj-+i(dj,irj) = log Pj—^ (dj , Ttj) and <f>j^i(dj,iVj) — (3~ 1 \ogQj^,i(dj,iTj) 
then the relations [4] and [5] reduce to 

ipj^i (dj ,nj) = -s itni + ^2 (t> k ^j (dj , nj ) (8) 

kedj/i 

(j>k^j(dj,-Kj) = max ip k ^j(d k ,ir k ) . (9) 

These equations are in the Max-Sum form and equalities hold up to some additive constant. In terms of these quantities 
marginals are given by ipj(di, nj) — —Cj Vj + J2k fik^jidj, nj) and the optimum tree is the one obtained by argmax ipj. 
If we introduce the variables • = max^^j V>fc->j(d, nk), Ck->j = V-'fc— j{d 7 j) and E^. = max(C^_ ■, Af. ■) it is 

enough to compute all the messages c6fc-+j (dj , nj) — A k i_^j , E^ 3 ^- for ttj — k and nj ^ k respectively. Using equations 
|S] and [5] we obtain the following set of equations: 



A 



U( t + 1 )= E E Lj(t)+ max Ut-J^-EUj^-Sj*) (10) 



keN(j)/i 

k£N(j)/i 



Cf^(t + l) = -s 3 ,+ ^-(*) ( u ) 

keN(j)/i 

Ef^(t + 1) - max (C£^(t + 1), A^(i + 1)) . (12) 

It has been demonstrated [7J that a fixed point of these equations with depth D > n is an optimal spanning tree. 
In the following two subsections, we show how to recover the SL and AP algorithms. On one hand, by computing 
the (unbounded depth) spanning tree on the enlarged matrix and then considering the connected components of its 
restriction to the set of nodes removing v*, we recover the results obtained by SL. On the other hand we obtain AP 
by computing the D = 2 spanning tree rooted at v*, defining the self-affinity parameter as the weight to reach this 
root node. 



A. Single Linkage limit 

Single Linkage is one of the oldest and simplest clustering methods, and there are many possible descriptions of 
it. One of them is the following: order all pairs according to distances, and erase as many of the pairs with largest 
distance so that the number of resulting connected components is exactly fc. Define clusters as the resulting connected 
components. 

An alternative method consists in removing initially all useless pairs (i.e. pairs that would not change the set 
of components when removed in the above procedure). This reduces to the following algorithm: given the distance 
matrix s, compute the minimum spanning tree on the complete graph with weights given by s. From the spanning 
tree remove the k — 1 links with largest weight. Clusters are given by the resulting connected components. In many 
cases there is no a priori desired number of clusters k and an alternative way of choosing fc is to use a continuous 
parameter A to erase all weights larger than A. 
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The Z5-MST problem for D > n identifies the minimum spanning tree connecting all n + 1 nodes (including the 
root v*). This means each node i will point to one other node 7Tj = j / v* if its weight satisfies the condition 
mim, Si j < Si tV * , otherwise it would be cheaper to connect it to the root (introducing one more cluster). We will make 
this description more precise. For simplicity, let's assume no edge in G(n, s) has weight exactly equal to A. 

The Kruskal algorithm [8j is a classical algorithm to compute a minimum spanning tree. It works by iteratively 
creating a forest as follows: start with a subgraph all nodes and no edges. The scan the list of edges ordered by 
increasing weight, and add the edge to the forest if it connects two different components (i.e. if it does not close a 
loop). At the end of the procedure, it is easy to prove that the forest has only one connected component that forms a 
minimum spanning tree. It is also easy to see that the edges added when applying the Kruskal algorithm to G(n, s) 
up to the point when the weight reaches A are also admitted on the Kruskal algorithm for G(n + 1, s*). After that 
point, the two procedures diverge because on G(n, s) the remaining added edges have weight larger than A while on 
G(n + 1, s*) all remaining added edges have weight exactly A. Summarizing, the MST on G(n + 1, s*) is a MST on 
G(n, s) on which all edges with weight greater than A have been replaced by edges connecting with v* . 



B. Affinity propagation limit 

Affinity Propagation is a method that was recently proposed in Q , based on the choice of a number of "exemplar" 
data-points. Starting with a similarity matrix s, choose a set of exemplar datapoints X C V and an assignment 
(j) : V \— *• X such that: cj)(x) — x if x G X and the sum of the distances between datapoints and the exemplars 
they map to is minimised. It is essentially based on iteratively passing two types of messages between elements, 
representing responsibility and availability. The first, rj_>j, measures how much an element i would prefer to choose 
the target j as its exemplar. The second a^j tells the preference for i to be chosen as an exemplar by datapoint j. 
This procedure is an efficient implementation of the Max-Sum algorithm that improves the naive exponential time 
complexity to Oln 2 ). The self-affinity parameter, namely s^, is chosen as the dissimilarity of an exemplar with 
himself, and in fine regulates the number of groups in the clustering procedure, by allowing more or less points to 
link with "dissimilar" exemplars. 

Given a similarity matrix s for n nodes, we want to identify the exemplars, that is, to find a valid configuration 
?f = {7ri, . . . , Tr n } such that 7r : [1, . . . , n] i— > [1, . . . , n] so as to minimise the function 

n 



i=l 



where the constraint reads 



These equations take into account the only possible configurations, where node i either is an exemplar, meaning 
7Tj = i, or it is not chosen as an exemplar by any other node j. The energy function thus reads 

E ( W \ = I ~ T,i *i,ir« V i { TT t = i U Vj TTj + i } . . 

^ ' 1 oo else v ' 

The cavity equations are computed starting from this definition and after some algebra they reduce to the following 
update conditions for responsibility and availability Q: 

r lt\ = s *,k - max 4- Sfc/.i) (16) 

= min 0, r h _ k + J2 max (0, r^ fc ) ] . (17) 

\ i'^k i 

In order to prove the equivalence between the two algorithms, i.e. D-MST for D = 2 and AP, we show in the 
following how the two employ an identical decomposition of the same energy function thus resulting necessarily to the 
same max sum equations. In the 2-MST equations, we are partitioning all nodes into three groups: the first one is 
only the root whose distance d = 0, the second one is composed of nodes pointing at the root d = 1 and the last one 
is made up of nodes pointing to other nodes that have distance d = 2 from the root. The following relations between 
di and 7Tj makes this condition explicit: 
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FIG. 2: In this figure we compare the results of Single Linkage and Affinity Propagation techniques on a SNP distance dataset. 
The dataset is composed of 269 individuals divided into four populations: CHB (red), CEU (green), YRI (yellow) and JPT 
(blue). The panels A and B are AP results while panels C and D show clusters obtained with SL. As A increases, both 
algorithms fail to divide Chinese from Japanese populations (panels A, C) before splitting the Nigerian population (yellow). 



It is clear that the distance variable di is redundant because the two kind of nodes are perfectly distinguished with 
just the variable 7Tj. Going a step further we could remove the external root v* upon imposing the following condition 
for the pointers -Ki = i 7Tj = v* 7Tj = j ^ i -O- 7tj ^ v*. This can be understood by thinking at AP procedure: since 
nodes at distance one from the root are the exemplars, they might point to themselves, as defined in AP, and all the 
non-exemplars are at distance d — 2 so they might point to nodes at distance d = 1. Using this translation, from 
Equation [2] it follows that 



, i V i {n = i U V j ^ i nj ^ i} , Q , 

« ^ _ 1 -oo else (1J) 



meaning that the constraints are equivalent X^eat = Tin S%(ir). Substituting ()19|) into equation (fTJ) we obtain 

that 

E{{^ d*U) = { ~ Ei ^ = * U V JV ' TS ^ ^ (20) 

The identification of the self affinity parameter and the self similarity s^* = A = allows us to prove the equivalence 
between this formula and the AP energy given in equation Q15p as desired. 



III. APPLICATIONS TO BIOLOGICAL DATA 



In the following sections we shall apply the new technique to different clustering problems and give a preliminary 
comparison to the two extreme limits of the interpolation, namely D — 2 (AP) and D — n (SL). 

Clustering is a widely used method of analysis in biology, most notably in the recents fields of transcriptomics 0] , 
proteomics and genomics (Toj. where huge quantities of noisy data are generated routinely. A clustering approach 
presents many advantages for this type for data: it can use all pre-existing knowledge available to choose group 
numbers and to assign elements to groups, it has good properties of noise robustness[Tl|, and it is computationally 
more tractable than other statistical techniques. In this section apply our algorithm to structured biological data, in 
order to show that by interpolating between two well-known clustering methods (SL and AP) it is possible to obtain 
new insight . 



A. Multilocus genotype clustering 



In this application we used the algorithm to classify individuals according to their original population using only 
information from their sequence SNPs as a distance measure[l2j]. A single-nucleotide polymorphism (SNP) is a DNA 
sequence variation occurring when a single nucleotide (A, T, C, or G) in the genome (or other shared sequence) 
differs between members of a species (or between paired chromosomes in a diploid individual). The dataset we used 
is from the HapMap Project, an international project launched in 2002 with the aim of providing a public resource 
to accelerate medical genetic research [l3|]. It consists of SNPs data from 269 individuals from four geographically 
diverse origins: 90 Utah residents from North and West Europe (CEU), 90 Yoruba of Ibadan, Nigeria (YRI), 45 Han 



FIG. 3: In this figure we report the clustering by D-MST with a fixed maximum depth D = 5. The algorithm gives only one 
mis-classification. 

Chinese of Beijing (CHB) and 44 Tokyo Japanese (JPT). CEU and YRI samples are articulated in thirty families of 
three people each, while CHB and JPT have no such structure. For each individual about 4 millions SNPs are given, 
allocated on different chromosomes. In the original dataset some SNPs were defined only in subpopulations, thus we 
extracted those which were well-defined for every sample in all populations and after this selection the number of 
SNPs for each individual dropped to 1.8 million. We defined the distance between samples by the number of different 
alleles on the same locus among individuals normalised by the total number of counts. The 269 x 269 matrix of 
distance S was defined as follows: 



where N is the number of valid SNPs loci and dij (n) is the distance between the n-th genetic loci of individuals i and 



The resulting distance matrix was given as input to D-MST algorithm. In figure [3] we show the clusters found by 
the algorithm using a maximum depth D = 5. Each individual is represented by a number and coloured according to 
the population he belongs to: green for YRI, yellow for CEU, blue for JPT and red for CHB. One can see that the 
algorithm recognises the populations grouping the individuals in four clusters. There is only one misclassified case, a 
JPT individual placed in the CHB cluster. 

Moreover, noticing that yellow and green clusters have a more regular internal structure than the other two, it is 
possible to consider them separately. Therefore, if one applies the D-MST algorithm to this restricted subset of 
data, all families consisting of three people can be immediately recovered, and the tree subdivides in 60 families of 3 
elements, without any error (details not reported). 

This dataset is particularly hard to classify, due to the complexity of the distances distribution. In fact the presence 
of families creates a sub-clustered structure inside the groups of YRI and CEU individuals. Secondly CHB and JPT 
people, even if they belong to different populations, share in general smaller distances with respect to those subsisting 
among different families inside one of the other two clusters. The Z5-MST algorithm overcomes this subtleties with 
the possibility of developing a complex structure and allows the correct detection of the four populations while other 
algorithms, such as AP, cannot adapt to this variability of the typical distance scale between groups in the dataset. 
Indeed, the hard constraint in AP relies strongly on cluster-shape regularity and forces clusters to appear as star of 
radius one: there is only one central node, and all other nodes are directly connected to it. Elongated or irregular 
multi-dimensional data might have more than one simple cluster centre. In this case AP may force division of single 
clusters into separate ones or may group together different clusters, according to the input self-similarities. Moreover, 
since all data points in a cluster must point to the same exemplar, all information about the internal structure, such 
as families grouping, is lost. Thus, after running AP, CHB and JPT are grouped in a unique cluster, and both CHB 
and JPT have the same exemplar, as shown on figure [2j\_. Going a step further and forcing the algorithm to divide 
Chinese from Japanese we start to split the YRI population (Fig. 03). 




(21) 




(22) 



FIG. 4: We report clusters results obtained using the D-MST algorithm with D = 2 (right) and D — 3 (left), considering only 
CHB and JPT. While both algorithms perform well in this subset, the 3-MST algorithm correctly classifies all individuals. 

Hierarchical Clustering also fails on this dataset, recognising the same three clusters found by Affinity Propagation 
at the 3-clusters level (Fig. [5p) and splitting yellow population in families before dividing blue from red (see figure 
[2JD). This makes sense relative to the typical dissimilarities between individuals, but prevents grasping the entire 
population structure. 

After considering all four populations together, we applied the D-MST algorithm only to the subset consisting 
of CHB and JPT individuals, because these data appeared the hardest to cluster correctly. The result is that the 
D-MST algorithm, with depth D = 3 succeeds in correctly detecting the two clusters without any miss-classification 
as shown in the left part of figured] Limiting the analysis to this selected dataset, Affinity Propagation identifies two 
different clusters, and is unable to divide CHB from JPT, still committing three mis-classifications, as viewed in right 
panel of the figure |U 

The cluster structure found is controlled both by the maximum depth D and by A, the two input parameters. In 
fact, in the extreme case D = 1 all samples in the dataset would be forced to point to the root, representing a single 
cluster. The next step is D = 2 and corresponds to Affinity Propagation. This allows to identify more than trivial 
clusters, as in the previous case, but, as we said, still imposes a strong constraint, which, in general, may be not 
representative of the effective data distribution. Increasing D one can detect more structured clusters, where different 
elements in the same group do not necessary share a strong similarity with the same reference exemplar, as it has 
to be with Affinity Propagation or if-means. On the other hand, the possibility to detect an articulated shape gives 
some information about the presence of eventual internal sub-structures notable to be analysed separately, as in our 
case the partition of two groups into families. 

The parameter A also affects the result of the clustering, in particular on the number of groups found. Assigning a 
large value to this parameter amounts to pay a high cost for every node connected to the root, and so to reduce the 
number of clusters; on the other side decreasing A will create more clusters. In this first application we used a value 
of A comparable with the typical distance between elements, allowing us to detect the four clusters. In this regime, 
one can expect a competition between the tendency of the elements to connect to other nodes or to form new clusters 
with links to the root, allowing the emergence of the underlying structure in the data. 

B. Clustering of proteins datasets 

An important computational problem is grouping proteins into families according to their sequence only. Biological 
evolution lets proteins fall into so-called families of similar proteins - in term of molecular function - thus imposing 
a natural classification. Similar proteins often share the same three-dimensional folding structure, active sites and 
binding domains, and therefore have very close functions. They often - but not necessarily - have a common ancestor, 
in evolutive terms. To predict the biological properties of a protein based on the sequence information alone, one 
either needs to be able to predict precisely its folded structure from its sequence properties, or to assign it to a 
group of proteins sharing a known common function. This second possibility stems almost exclusively from properties 
conserved in through the evolutionary time, and is computationally much more tractable than the first one. We want 
here to underline how our clustering method could be useful to handle this task, in a similar way as the one we used 
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FIG. 5: In the three panels we show the average number of clusters over the random noise as a function of the weight of the 
root for D — 2, 3, 4 respectively. For each graph we show the number of clusters (circle) and the associated F value (square), 
computed as a function of precision and recall. We want to emphasise the fact the highest F values are reached for depth D = 4 
and weight A ~ 1.3. With this choice of the parameters we found the number of clusters is of order 10, a good approximation 
of the number of superfamilies shown in figure as a straight line. 



in the first application, by introducing a notion of distance between proteins based only on their sequences. The 
advantage of our algorithm is its global approach: we do not take into account only distances between a couple of 
proteins at a time, but we solve the clustering problem of finding all families in a set of proteins in a global sense. 
This allows the algorithm to detect cases where related proteins have low sequence identity. 

To define similarities between proteins, we use the BLAST E- value as a distance measure to assess whether a given 
alignment between two different protein sequences constitutes evidence for homology. This classical score is computed 
by comparing how strong an alignment is with respect to what is expected by chance alone. This measure accounts for 
the length of the proteins, as long proteins have more chance to randomly share some subsequence. In essence, if the 
E- value is the match is perfect while the more E- value is high the more the average similarity of the two sequences is 
low and can be considered as being of no evolutionary relevance. We perform the calculation in a all-by-all approach 
using the BLAST program, a sequence comparison algorithm introduced by Altshul et al. [fH ], 

Using this notion of distance between proteins we are able to define a matrix of similarity s, in which each entry Sij 
is associated to the E- value between protein i and j. The D-MST algorithm is then able to find the directed tree 
between all the sets of nodes minimising the same cost function as previously. The clusters we found are compared to 
those computed by other clustering methods in the literature, and to the "real" families of function that have been 
identified experimentally. 

As in the work by (Taj, we use the Astral 95 compendium of SCOP database [l6| where no two proteins share more 
than 95% of similarity, so as not to overload the clustering procedure with huge numbers of very similar proteins that 
could easily be attributed to a cluster by direct comparison if necessary. As this dataset is hierarchically organised, 
we choose to work at the level of superfamilies, in the sense that we want identify, on the basis of sequence content, 
which proteins belong to the same superfamily. Proteins belonging to the same superfamily are evolutionary related 
and share functional properties. Before going into the detail of the results we want to underline the fact that we do 
not modify our algorithm to adapt to this dataset structure, and without any prior assumption on the data, we are 
able to extract interesting information on the relative size and number of clusters selected (Figj6]). Notably we do 
not use a training set to optimise a model of the underlying cluster structure, but focus only on raw sequences and 
alignments. 

One issue that was recently put forward is the alignment variability [l7| depending on the algorithms employed. 
Indeed some of our results could be biased by errors or dependence of the dissimilarity matrix upon the particular 
details of the alignments that are used to compute distances, but in the framework of a clustering procedure these 
small-scale differences should stay unseen due to the large scale of the dataset. On the other hand, the great advantage 
of working only with sequences is the opportunity to use our method on datasets where no structure is known a priori, 
such as fast developing metagenomics datasets [18fl . We choose as a training set 5 different superfamilies belonging 
to the ASTRAL 95 compendium for a total number of 661 proteins: a) Globin-like, b) EF-hand, c) Cupredoxin, 
d) Trans- Glycosidases and e) Thioredoxin-like. Our algorithm is able to identify a good approximation on the real 
number of clusters. Here we choose the parameter A well above the typical weight between different nodes, so as to 
minimise the number of groups found. As a function of this weight you can see the number of clusters found by the 
D-MST algorithm reported in figure 03 for the depths D = 2,3,4. In these three plots we see the real value of the 
number of clusters is reached for different values of the weight A ~ 12,2,1.4 respectively. The performance of the 
algorithm can be analysed in terms of precision and recall. These quantities are combined in the F-value 15( defined 
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FIG. 6: We show the results of clustering proteins of the 5 subfamilies Globin-like (Gl), EFhand (EF), Cupredoxin (Cu), 
Trans- Glycosidases (TG), Thioredoxin-like (Th) using 4-MST with parameter A=1.45. We see that most of the proteins of the 
first three families (Gl, EF and Cu) are correctly grouped together respectively in cluster 4, 1 and 3 while the last two families 
are identified with clusters 2 and 5 with some difficulties. 
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FIG. 7: We plot the F value for depths D — 2, 3, 4, 5 as a function of the number of clusters found by the D-MST algorithm. The 
case D = 2 provides the AP results while D > n is associated to SL and gives value well below 0.4. The highest performance 
in terms of the F value is reached for depth D — 4 and number of clusters ~ 10. We draw a line in correspondence to the 
presumed number of clusters which is 5 where again the algorithm with parameter D = 4 obtains the highest performance 
score. 



as 

I 2 n h 

F = —}n h max , 1 , (23) 

h 

where rij is the number of nodes in cluster i according to the classification A we find with the D-MST algorithm, n h is 
the number of nodes in the cluster h according to the real cluster classification K and is the number of predicted 
proteins in the cluster i and at the same time in the cluster h. In both cases the algorithm performs better results 
for lower value of A. This could be related to the definition of the F value because starting to reduce the number of 
expected clusters may be misleading in the accuracy of the predicted data clustering. 

Since distances between datapoints have been normalised to be real numbers between to 1, when A — > oo we 
expect to find the number of connected components of the given graph G(n, s). While lowering this value, we start 
to find some configurations which minimise the weight respect to the single cluster solution. The role played by the 
external parameter A could be seen as the one played by a chemical potential tuning from outside the average number 
of clusters. 

We compare our results to the ones in [l5| for different algorithms and it is clear that intermediate values of D 
gives best results on the number of clusters detected and on the F-value reached without any a priori treatment of 
data. It is also clear that D-MST algorith mwith D = 3, 4, 5 gives better results than AP (case D — 2) as can be seen 
in Fig. Ui 
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FIG. 8: Cluster decomposition broken down by cause-of-death (from 1 to 23) produced by AP (blue) and D-MST (green). The 
parameter A is chosen from the stable region, where the number of clusters is constant. 



We believe that the reason is that clusters do not have an intrinsic spherical regularity. This may be due to the 
fact that two proteins having a high number of differences between their sequences at irrelevant sites can be in the 
same family. Such phenomena can create clusters with complex topologies in the sequence space, hard to recover with 
methods based on a spherical shape hypothesis. We compute the F-value also in the single linkage limit (D > n) and 
its value is almost ~ 0.38 in all the range of clusters detected. This shows that the quality of the predicted clusters 
improves reaching the highest value when D — 4 and then decreases when the maximum depth increases. 



C. Clustering of verbal autopsy data 

The verbal autopsy is an important survey-based approach to measuring cause-specific mortality rates in populations 
for which there is no vital registration system pjl, [20j . We applied our clustering method to the results of 2039 
questionnaires in a benchmark verbal autopsy dataset, where gold-standard cause-of-death diagnosis is known for 
each individual. Each entry in the dataset is composed of responses 47 yes/no /don't know questions. 

To reduce the effect of incomplete information, we restricted our analysis to the responses for which at least 91% 
of questions answered yes or no (in other words, at most 9% of the responses were "don't know"). This leaves 743 
responses to cluster (see for a detailed descriptive analysis of the response patterns in this dataset.) 

The goal of clustering verbal autopsy responses is to infer the common causes of death on the basis of the answers. 
This could be used in the framework of "active learning" , for example, to identify which verbal autopsies require 
further investigation by medical professionals. 

As in the previous applications, we define a distance matrix on the verbal autopsy data and apply D-MST with 
different depths D. The questionnaires are turned into vectors by associating to the answers yes/no/don't know the 
values 0/1/0.5 respectively. The similarity matrix is then computed as the root mean square difference between vectors, 

dij = ^"yE* (si(k) — Sj(fc)) 2 , where Sj(fc) £ {0, 1,0.5} refers to the symptom k e [0,47] in the i-th questionnaire. 

We first run 2-MST (AP) and 4-MST on the dataset and find how the number of clusters depend on A. We identify 
a stable region which corresponds to 3 main clusters for both D = 2,4. As shown in figure [51 to each cluster we can 
associate a different causes of death. Cluster 1 contains nearly all of the Ischemic Heart Disease deaths (cause 5) and 
about half of the Diabetes Mellitus deaths (cause 6). Cluster 2 contains most of the Lung Cancer deaths (cause 13) 
and Chronic Obstructive Pulmonary Disease deaths (cause 15). Cluster 2 also contains most of the additional IHD 
and DM deaths (30% of all deaths in the dataset are due to IHD and DM) . Cluster 3 contains most of the Liver Cancer 
deaths (cause 11) as well as most of the Tuberculosis deaths (cause 2) and some of the other prevalent causes. For 
D = 2 we find no distinguishable hierarchical structure in the 3 clusters, while for higher value we find a second-level 
structure. In particular for D = 4 we obtain 57-60 subfamilies for value of A in the region of 0.15 — 0.20. Although 
the first-level analysis (Fig|5j3) underlines the similarity of D-MST algorithm with AP, increasing the depth leads to 
a finer sub-clusters decomposition [21| . 
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D. Conclusion 

We introduced a new clustering algorithm which naturally interpolates between partitioning methods and hierar- 
chical clustering. The algorithm is based on the cavity method and finds a bounded depth D spanning tree on a 
graph G(V, E) where V is the set of n vertices identified with the data points plus one additional root node and E is 
the set of edges with weights given by the dissimilarity matrix and by a unique distance A from the root node. The 
limits D = 2 and D = n reduce to the well known AP and SL algorithms. The choice of A determines the number 
of clusters. Here we have adopted the same criterion as in ref. [3]: the first non trivial clustering occurs when the 
cluster number is constant for a stable region of A values. 

Preliminary applications on three different biological datasets have shown that it is indeed possible to exploit the 
deviation from the purely D = 2 spherical limit to gain some insight into the data structures. Our method has 
properties which are of generic relevance for large scale datasets, namely scalability, simplicity and parallelizability. 
Work is in progress to systematically apply this technique to real world data. 
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